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Abstract 

Background: Nonparametric Bayesian techniques have been developed recently to extend the sophistication of 
factor models, allowing one to infer the number of appropriate factors from the observed data. We consider such 
techniques for sparse factor analysis, with application to gene-expression data from three virus challenge studies. 
Particular attention is placed on employing the Beta Process (BP), the Indian Buffet Process (IBP), and related 
sparseness-promoting techniques to infer a proper number of factors. The posterior density function on the model 
parameters is computed using Gibbs sampling and variational Bayesian (VB) analysis. 

Results: Time-evolving gene-expression data are considered for respiratory syncytial virus (RSV), Rhino virus, and 
influenza, using blood samples from healthy human subjects. These data were acquired in three challenge studies, 
each executed after receiving institutional review board (IRB) approval from Duke University. Comparisons are 
made between several alternative means of per-forming nonparametric factor analysis on these data, with 
comparisons as well to sparse-PCA and Penalized Matrix Decomposition (PMD), closely related non-Bayesian 
approaches. 

Conclusions: Applying the Beta Process to the factor scores, or to the singular values of a pseudo-SVD 
construction, the proposed algorithms infer the number of factors in gene-expression data. For real data the "true" 
number of factors is unknown; in our simulations we consider a range of noise variances, and the proposed 
Bayesian models inferred the number of factors accurately relative to other methods in the literature, such as 
sparse-PCA and PMD. We have also identified a "pan-viral" factor of importance for each of the three viruses 
considered in this study. We have identified a set of genes associated with this pan-viral factor, of interest for early 
detection of such viruses based upon the host response, as quantified via gene-expression data. 



must address the "large p, small n" problem [1], in 
which often n « p. Therefore, to yield reliable infer- 
ence, one must impose strong restrictions on the form 
of the model. 

When developing regression and classification models 
for gene-expression data, a widely employed assumption 
(restriction) is that the model parameters are sparse, 
implying that only a small subset of the genes are impor- 
tant for prediction. If only a small set of genes (« p) are 
responsible for differences in disease groups, then reliable 
inference may often be performed even when n « p. 
Example approaches that have taken this viewpoint are 
lasso [2], the elastic net [3], and related Bayesian 
approaches [4]. In fact, sparse regression and 
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I. Background 

When performing gene-expression analysis for inference 
of relationships between genes and conditions/pheno- 
types, one typically must analyze a small number of 
samples, each composed of expression values from tens 
of thousands of genes. In this setting the observed data 
is X e R pxn , where each column corresponds to one of 
n samples, quantifying the associated gene-expression 
values for all p genes under investigation. We typically 
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classification algorithms are widely used in many statis- 
tics and machine-learning applications, beyond gene ana- 
lysis [5-7]. 

An important research direction for gene-expression 
analysis, and many other applications, involves the use 
of factor models [8-11]. To address the "large p, small 
n" problem, sparseness is again imposed, now typically 
on the factor loadings. Specifically, in an unsupervised 
setting the data are assumed to satisfy 

X = AS + E (1) 

where A e R pxr , S e R rxn and E e ; ' x : '; if covari- 
ates are available they may also be considered in the 
model [11], with none assumed here. Note that here 
and henceforth we assume that the gene-expression data 
are centered in advance of the analysis; otherwise, there 
should be an intercept added to the model. Considering 
the /th sample, Xj , corresponding to the /th column of 
X, the model states that Xj = Asj + ej , where Sj and e,- 
are the /'th columns of S and E, respectively. 

The columns of A represent the factor "loadings", and 
rows of S are often called factors. To address the fact 
that n « p, researchers have typically imposed a sparse- 
ness constraint on the columns of A [11], with the idea 
that each column of A should ideally (in the gene appli- 
cation) correspond to a biological "pathway", which 
should be defined by a relatively small number of corre- 
lated genes. Within Bayesian formalisms, the sparse col- 
umns of A are typically imposed via spike-slab-like 
priors [1], [11], or alternatively via shrinkage (e.g., Stu- 
dent-t [6]) priors. Several non-Bayesian approaches have 
also been introduced, including sparse-PCA [12] and the 
related Penalized Matrix Decomposition (PMD) [13]. 

A problem that is receiving increased attention in fac- 
tor-analysis-based approaches is a means of defining an 
appropriate number of factors {i.e., to infer r). The non- 
Bayesian approaches are often sequential, and one may 
infer r by monitoring the error ||£||f as a function of 
iteration number [12], [13]. In many previous Bayesian 
approaches r has just been set [11], and presumably 
many non-biologically-relevant factor loadings are 
inferred. A computationally expensive reverse-jump 
MCMC approach has been developed [14], with compu- 
tational efficiency improved in [15] while also consider- 
ing a default robust prior specification. Perhaps the 
most widely employed approach [16-18] for choosing r 
is the Bayesian information criteria (BIC). A disadvan- 
tage is that conditioning on a fixed choice of the num- 
ber of factors ignores uncertainty and the BIC is not 
well justified in hierarchical models, as the number of 
parameters is unclear. 

There has been recent interest in applying nonpara- 
metric Bayesian methods [8], [9] to infer r (in fact, a 



posterior distribution on r), based on the observed data 
X. An example of recent research in this direction 
employs the Indian Buffet Process (IBP) [19], [20]. In 
this paper we also consider the Beta Process (BP), recog- 
nizing that the BP and IBP are closely linked [21], [22]. 

For data sets with very large p (e.g., 10,000 or more), 
computational efficiency is of major practical impor- 
tance. In previous use of nonparametric Bayesian meth- 
ods to this problem, a Gibbs sampler has typically been 
employed [11]. The BP-based formulation admits a rela- 
tively simple variational Bayesian (VB) [23] approxima- 
tion to the posterior, which is considerably faster than 
Gibbs sampling. An advantage of a VB analysis, in addi- 
tion to speed, is that convergence is readily monitored 
(for the Gibbs sampler there are typically challenges 
when assessing convergence). We perform a comparison 
of the difference in inferred model parameters, based on 
VB and Gibbs analysis. 

The specific data on which the models are employed 
correspond to gene-expression data from recent viral chal- 
lenge studies. Specifically, after receiving institutional 
review board (IRB) approvals from Duke University, we 
performed three separate challenge studies, in which indi- 
viduals were inoculated with respiratory syncytial virus 
(RSV), Rhino virus, and influenza. Informed consent was 
used in all studies. Using blood samples collected sequen- 
tially over time, we have access to gene-expression data at 
pre-inoculation, just after inoculation, and at many addi- 
tional time points up to the point of full symptoms (such 
data were collected on all subjects, although not all 
became symptomatic). Using these data, we may investi- 
gate time-evolving factor scores of samples, to examine 
how the response to the virus evolves with time. Of parti- 
cular importance is an examination of the factors of 
importance for individuals who became symptomatic rela- 
tive to those who did not. In the factor analysis we con- 
sider data individually for each of the three viruses (at all 
times), as well as for all three viruses in a single analysis 
(seeking pan- viral factors). Results are generated based on 
nonparametric Bayesian approaches to factor analysis, 
employing the Beta Process, the Indian Buffet Process, and 
a related sparseness-constrained pseudo-SVD construction 
(a Bayesian construction of sparse-PCA [12]). We also 
make comparisons to the non-Bayesian Penalized Matrix 
Decomposition (PMD) [13]. 

II. Results 

A. Brief summary of models 

We first provide a brief intuitive explanation of the 
workings of the different Bayesian models considered. 
These models are built around the Indian buffet process 
(IBP) [19], so named for the following reason. In the 
factor model of (1), the columns of A represent factor 
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loadings in which the gene-expression values for sample 
j are expressed: x t = Asj + e, . One construction of the 
IBP constitutes a set of candidate columns of A, and 
these are termed "dishes" at an Indian "buffet". Each of 
the n samples {xj\ t = i jfI correspond to "customers" at 
the buffet; each customer selects a subset of dishes from 
the buffet {i.e., selects a subset of candidate columns of 
A). The IBP is constructed such that the more a particu- 
lar dish (column of A) is used by a subset of customers 
= i.w the more probable it is that it will be used by 
other customers. Thus, the IBP imposes the idea that 
many of the samples {xj\j = l H will utilize the same sub- 
set of columns of A, but each sample may also utilize 
idiosyncratic factor loadings, representing unique char- 
acteristics of particular samples. The IBP construction 
does not impose a total number of factors for the data 
{Xj\i = i,n> with this number inferred by the analysis. 
Thus, the IBP is a natural Bayesian method for inferring 
the number of factors appropriate for representing all 
observed data {xj\j = i ; „. A convenient means of imple- 
menting the IBP employs the Beta process (BP) [21]. 

There are multiple ways in which one may utilize the 
IBP/BP within the factor model, with three such meth- 
ods considered here: (i) the BP is applied to the factor 
scores S (termed below the BP construction), (ii) the 
IBP is employed on the factor loadings A [8] (termed 
below the IBP construction), and (Hi) a BP-like con- 
struction is employed to implement a Bayesian con- 
struction of a singular-value decomposition of X 
(termed below the pseudo-SVD construction). To realize 
the approximate posterior density function for the para- 
meters of these models, we have considered both 
MCMC and VB computational methods. The specifics 
of the BP, IBP and pseudo-SVD methods, as well as 
computational details, are provided in Section IV. 

B. Synthesized Data 

The first validation example we considered was taken 
from [8]. In this example the gene-factor connectivity 
matrix of an E-coli network is employed to generate a 
synthetic dataset having 100 samples of 50 genes and 8 
underlying factors. The data had additive white Gaussian 
noise with a signal-to-noise-ratio of 10. For this very 
small-scale example we considered all three Bayesian 
methods (BP, IBP and pseudo-SVD); in each case we 
considered both MCMC and VB methods for inferring 
the posterior density function. We also considered the 
non-Bayesian PMD and sparse-PCA [13], [24]. All meth- 
ods performed well in uncovering the proper number of 
factors, and in capturing the proper genes associated 
with each factor. For brevity we do not provide further 
details on this example. While it is worthy of considera- 
tion because it was considered in related published 
research [8], its small-scale nature (only 50 genes) 



makes it less relevant for the large-scale real application 
we consider below. Therefore, in the next synthetic 
example we consider a much larger-scale problem, and 
consequently for that problem we were unable to test 
against the IBP method. 

The synthetic data were generated as follows. A total of 
p = 10, 000 features ("genes") are employed, and the 
expression value for these p genes was constituted using 
five factors (r = 5) plus a noise term E (i.e., via the model 
in (1)). For each of the five factors, a unique set of 50 
genes were selected and were given a factor-loading value 
of one. In addition, ten more genes were selected, with 
these shared among all five factors (again with unit- 
amplitude contribution to the factor loadings). Thus, a 
total of 260 genes contributed non-zero loadings to at 
least one of the five factors. For all other genes the fac- 
tor-loading contribution was set to zero. The above con- 
struction defines the sparse matrix A in (1). The 
components of S e R rx ", for n = 150 samples, are drawn 
i.i.d. from J\f (0, 1). The elements of the noise matrix E 
are drawn i.i.d. from A/^Carj 1 ) • The data X were then 
utilized within the various factor-analysis models, with 
the data-generation process repeated 100 independent 
times (100 different X), with mean and standard-devia- 
tion results presented on the inferred model parameters 
(discussed below), based on the 100 runs. 

We consider a range of noise variances l/a 0 to consti- 
tute E, to address model performance as a function of 
the signal-to-noise ratio (SNR). As one definition of 
SNR, one may consider the average energy contributed 
from a non-zero gene to a particular factor, relative to 
the energy in the noise contribution for that gene, from 
E. Based on the fact that the non-zero components of A 
have unit amplitude, and the components of S are 
drawn from J\f (0, 1), on average (across many sam- 
ples) the energy contributed by a non-zero gene to a 
particular factor is one. The average noise energy con- 
tributed to each gene is l/a 0 . Hence, the ratio of these 
two quantities, a 0 , may be considered as a measure of 
SNR. Other measures of SNR may be defined with 
respect to this model, each of which will be defined in 
terms of a 0 . 

In Figure 1 are presented the average number of 
inferred factors and the associated standard deviation on 
this number, for the BP and pseudo-SVD models. We 
also compare to the sparse-PCA model in [12]. The 
integer K represents the truncation level in the models, 
defining the maximum number of columns of A consid- 
ered for analysis, from which r < K columns are inferred 
as needed to represent the data X. This is discussed in 
detail in Section IV. In these examples the models were 
each truncated to K = 30 factors. Consequently, when 
30 factors are used, the models have effectively failed, 
since the true number of factors is 5 and 30 is the 
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Figure 1 Number of inferred factors for various algorithms, as applied to synthesized data (for which there are five factors used to 
generate the data). The data were generated randomly 100 times, with mean and standard deviation depicted. The horizontal axis denotes 
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maximum allowed within the model, given the trunca- 
tion level under consideration. The MCMC results are 
based upon 2000 burn-in iterations and 1000 collection 
iterations (the results are similar when 10,000 collection 
iterations are employed). Results are shown as a func- 
tion of the standard deviation of the noise, 1 / ^[a^ . 

The sparse-PCA model works well up to the point that 
the noise variance equals the amplitude of the non-zero 
values in A (approximate SNR of one), while most of 
the Bayesian methods infer the proper number of fac- 
tors to a higher level of noise. 

In Figure 2 we examine how meaningful the inferred 
factor loadings are. Specifically, recall that the data are 
based upon 260 unique genes that contribute to the fac- 
tor loadings. Based on the inferred factor loadings, we 
rank the genes based upon their strength in the load- 
ings. We then rank the genes from 1 to 260, based on 
the above strength, and examine the percentage of the 
top 260 inferred genes are consistent with truth. Consid- 
ering Figure 2, all of the Bayesian methods perform well 
in this task, up to a noise standard deviation of approxi- 
mately 1.3, while sparse-PCA performs degrades quickly 
beyond standard deviations of one (for SNR values 
below one). Note that we also consider the Bayesian fac- 
tor analysis model in [11]; we did not consider this 



method in Figure 1 because it does not have a mechan- 
ism for estimating r-we simply set r = K in this analysis, 
using the same K = 30 as employed for the other Baye- 
sian methods. In [11] the authors only considered an 
MCMC implementation, where here we consider both 
MCMC and VB inference for this model; further, here 
we have employed a Student-t prior on the components 
of the factor loading matrix A, where in [11] a spike- 
slab prior was employed. 

Concerning sparse-PCA [12] (and PMD, not shown), 
every effort was made to optimize the model parameters 
for this task. Our experience is that, while sparse-PCA and 
PMD [13] are very fast algorithms, and generally quite 
effective, they are not as robust to noise as the Bayesian 
methods (the Bayesian methods are also less sensitive to 
parameter settings). It is possible that the sparse-PCA and 
PMD results could be improved further if the model para- 
meters are optimized separately for each noise level (and 
the Bayesian results may also be improved with such tun- 
ing). However, the model parameters were fixed for all 
noise variances considered (since the noise variance is 
often not known a priori, with the sparse-PCA carefully 
tuned to achieve the best results in such a circumstance. 

We also performed additional simulated examples of 
the type discussed above, the details of which are 
omitted for brevity. In those experiments the different 
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Figure 2 Considering the same data as in Figure 1, for which 260 genes had non-zero contributions to the factor loadings used for 
data generation, we plot the percentage of the inferred most important 260 genes that are consistent with the true genes used for 
data generation. A value of 100% implies that all of the inferred top-260 genes are consistent with those used for data generation. The data 
were generated randomly 100 times, with mean and standard deviation depicted. 



genes did not have the same noise variance. The Baye- 
sian methods, which as indicated above infer the noise 
variance separately for each gene, performed as well as 
in Figures 1 and 2. However, the sparse-PCA and PMD 
models performed relatively poorly in this case, since 
they assume the same noise variance for all genes. The 
assumption of a constant noise variance for each gene 
may not be as appropriate for real data. 

C. Details on Data Collections for Three Viral Challenge 
Studies 

We considered three cohorts of healthy volunteers 
experimentally infected with either rhinovirus, respira- 
tory syncytial virus (RSV) or influenza A; these three 
challenges were performed separately, with no overlap 
in the subjects. All exposures were approved by the 
Duke University institutional review board and con- 
ducted according to the Declaration of Helsinki. The 
three challenges are briefly summarized here, with 
further details provided in [25]. 
Human Rhinovirus cohort 

We recruited 20 healthy volunteers via advertisement to 
participate in the rhinovirus challenge study through an 
active screening protocol at the University of Virginia 
(Charlottesville, VA). On the day of inoculation, 10 
TCID50 GMP rhinovirus (Johnson and John-son) was 
inoculated intranasally. Subjects were admitted to the 



quarantine facility for 48 hours following rhinovirus 
inoculation and remained in the facility for 48 hours fol- 
lowing inoculation. Blood was sampled into PAXGen- 
e™blood collection tubes at pre-determined intervals 
post inoculation. Nasal lavage samples were obtained 
from each subject daily for rhinovirus titers to accu- 
rately gauge the success and timing of the rhinovirus 
inoculation. Following the 48th hour post inoculation, 
subjects were released from quarantine and returned for 
three consecutive mornings for sample acquisition and 
symptom score ascertainment. 
Human RSV cohort 

A healthy volunteer intranasal challenge with RSV A 
was performed in a manner similar to the rhinovirus 
intranasal challenge. The RSV challenge was performed 
at Ret-roscreen Virology, Ltd (Brentwood, UK) using 20 
pre-screened volunteers who provided informed con- 
sent. On the day of inoculation, a dose of 10 4 TCID50 
respiratory syncytial virus (RSV; serotype A) manufac- 
tured and processed under current good manufacturing 
practices (cGMP) by Meridian Life Sciences, Inc. (Mem- 
phis, TN USA) was inoculated intranasally per standard 
methods. Blood and nasal lavage collection methods 
were similar to the rhinovirus cohort, but continued 
throughout the duration of the quarantine. Due to the 
longer incubation period of RSV A, subjects were not 
released from quarantine until after the 165th hour and 



Chen et al. BMC Bioinformatics 2010, 11:552 
http://www.biomedcentral.eom/1 471-21 05/1 1 /552 



Page 6 of 16 



were negative by rapid RSV antigen detection (Binax- 
Now Rapid RSV Antigen; Inverness Medical Innova- 
tions, Inc). 
Influenza cohort 

A healthy volunteer intranasal challenge with influenza 
A/Wisconsin/67/2005 (H3N2) was performed at Retro- 
screen Virology, LTD (Brentwood, UK), using 17 pre- 
screened volunteers who provided informed consent. 
On the day of inoculation, a dose of 106 TCID50 Influ- 
enza A manufactured and processed under current good 
manufacturing practices (cGMP) by Bayer Life Sciences, 
Vienna, Austria was inoculated intranasally per standard 
methods at a varying dose (1:10, 1:100, 1:1000, 1:10000) 
with four to five subjects receiving each dose. Due to 
the longer incubation period of influenza as compared 
to rhinovirus, subjects were not released from quaran- 
tine until after the 216th hour. Blood and nasal lavage 
collection continued throughout the duration of the 
quarantine. All subjects received oral oseltamivir (Roche 
Pharmaceuticals) 75 mg by mouth twice daily prophy- 
laxis at day 6 following inoculation. All patients were 
negative by rapid antigen detection (BinaxNow Rapid 
Influenza Antigen; Inverness Medical Innovations, Inc) 
at time of discharge. 

For each viral challenge, subjects had samples taken 
24 hours prior to inoculation with virus (baseline), 
immediately prior to inoculation (pre-challenge) and at 
set intervals following challenge. For the rhinovirus chal- 
lenge, peripheral blood was taken at baseline, then at 4 
hour intervals for the first 24 hours, then 6 hour inter- 
vals for the next 24 hours, then 8 hour intervals for the 
next 24 hours, and then 24 hour intervals for the 
remaining 3 days of the study. For the RSV and influ- 
enza challenges, peripheral blood was taken at baseline, 
then at 8 hour intervals for the initial 120 hours, and 
then 24 hours for the remaining 2 days of the study. All 
results presented here are based on gene-expression 
data from blood samples. For the RSV and Rhino virus 
cases not all blood samples were converted to gene 
expression values, as a cost-saving measure. Hence, for 
these two cases the gene expression data are not 
sampled as finely in time as are the influenza data. 

In the statistical analysis, the matrix X in (1) has col- 
umns that correspond to the n samples; n = n s n t , with 
n s representing the number of subjects and n t the num- 
ber of sample time points. We do not impose a prior on 
the time-dependence of the factors scores, and uncover 
this time dependence via the inferred posterior distribu- 
tion of factor scores S. 

D. Analysis of influenza data 

The gene-expression data consisted of over p = 12, 000 
genes, and consequently we found that the IBP 
approach developed in [8] was computationally 



intractable. We found that the VB and MCMC results 
were generally in good agreement for this real data, and 
therefore the two very distinct computational tools 
served to cross-validate each other. The VB and MCMC 
computations also required similar CPU time (for the 
number of Gibbs iterations considered); while the VB 
analysis required far fewer iterations to converge, each 
iteration is significantly more expensive than that asso- 
ciated with the Gibbs sampler. 

For brevity, we here focus exclusively on MCMC solu- 
tions when considering Bayesian analysis. Results are 
presented using the BP and pseudo-SVD methods, as 
well as via PMD [13] (similar results were computed 
using sparse-PCA [24]). We note that the design of each 
the experiments involves samples from the same sub- 
jects observed at multiple time points (with different 
subjects for the three viruses). Therefore, the assump- 
tion within the models that the samples at different 
times are statistically independent may warrant reconsi- 
dering in future studies. This subject has been consid- 
ered in related work [26], although that research 
assumes a known factor structure and Gaussian latent 
factors. 

We first consider results based on the BP as applied to 
the factor scores. In these results we set K = 30 (recall 
this is the truncation level on the number of factors), 
and inferred approximately r = 13 important factors (see 
Figure 3); although only approximately r = 13 factors 
are used, we show the factor scores for all K = 30 possi- 
ble factors such that the sparseness of the unused fac- 
tors is evident, as inferred via the posterior. The results 
in Figure 3 correspond to one example (representative) 
collection sample from the Gibbs sampler; Factor 1, 
which is most closely tied to the symptomatic/asympto- 
matic response, is employed by all data, while other fac- 
tors are used more idiosyncratically {e.g., Factors 3 and 
14 are only used by a small subset of the data samples; 
see the detailed discussion of the model in the Methods 
section). 

At each time point, there are data from 17 subjects 
(the same individuals were sampled at a sequence of 
times). The horizontal axis in Figure 3 corresponds to a 
sequence of groups of data, proceeding in time from 
inoculation, with generally 17 samples per time point 
(all data will be released for other investigators to 
experiment with). The blue points correspond to sam- 
ples of individuals who eventually became symptomatic, 
and the red points to asymptomatic individuals. 

The vertical axis in these plots corresponds to the fac- 
tor score associated with the respective sample. We 
observe in Figure 3 that Factor 1 (the factor indexing is 
arbitrary) provides a clear discriminator of those who 
will become symptomatic, particularly as time proceeds 
(note that the model is completely unsupervised, and 
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Figure 3 Factor-analysis of Flu data with BP applied within design of factor scores, as discussed in Section IV-B The MCMC inference 
was based on 2000 burn-in iterations and 500 collection iterations, and factor scores are depicted for one (typical) collection sample from the 
Gibbs sampler. Approximately thirteen factors were inferred with non-zero factor scores (shown at right), and at left is a blow-up of the factor 
that most separates symptomatic (blue) from asymptomatic (red) samples. The horizontal axis denotes time in hours. The data were collected in 
groups, at discrete times; the results at a given time are shifted slightly along the horizontal axis with respect to one another, to enhance 
readability. 



therefore this discriminating power was uncovered with- 
out using label information). 

Having introduced the form of the data presentation, 
we now present results using the pseudo-SVD method 
and PMD; for the pseudo-SVD method we again show 
one (typical) sample from the Gibbs collection samples, 
while for PMD the results are the single solution. In 



Figures 4 and 5 we present results, respectively, for the 
Bayesian pseudo-SVD model and for PMD [13]. For the 
Bayesian methods we again set K = 30. Both methods 
uncover a relatively small (less than K) number of rele- 
vant factors. 

Note that in each case there appears to be one factor 
that clearly distinguishes symptomatic vs. asymptomatic, 
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Figure 4 Factor-analysis of Flu data with Bayesian pseudo-SVD applied within design of factor scores, applied to the Flu 

are presented in the same form as Figure 3. 
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Figure 5 Factor-analysis of Flu data with PMD ^applied within design of factor scores, applied to the Flu data Blue points correspond 
to samples of individuals who ultimately become symptomatic, and the red points correspond to asymptomatic samples. 



particularly as time increases. Upon examining the impor- 
tant genes in each of these factors, one recognizes a high 
level of overlap (suggesting consistency between the differ- 
ent methods). Further discussion of the associated genes 
and their biological significance is provided in [25] . 

E. Pan-viral factors 

We now consider a "pan-viral" analysis, in which data 
from all three viruses are analyzed jointly. For further con- 
ciseness, for this example we only present results for the 
BP applied to the factor scores; similar results were 
obtained with the Bayesian pseudo-SVD framework and 
by PMD. 

Since three viruses are now considered jointly, we 
have increased K to K = 60 in this example, and now 
approximately 46 factors were inferred (with non-zero 
factor scores). Considering Figure 6, we note that Factor 
20 provides good discrimination between the sympto- 
matic (blue) and asymptomatic (red) samples, with this 
factor examined more closely in Figure 7. This same fac- 
tor is able to distinguish the samples of each virus, at 
sufficient time after inoculation (a single "pan-viral" fac- 
tor has been inferred, able to separately distinguish 
symptomatic vs. asymptomatic for each of the three 
viruses considered). Factor 19 in Figure 6 also appears 
to provide separation between symptomatic and asymp- 
tomatic samples; however, this is manifested because it 
contains two genes that are highly discriminative (SERP- 
ING1 and TNFAIP6), with most of the other genes in 
Factor 19 not discriminative. When addressing biologi- 
cal significance in [25], the focus is on Factor 20 in 
Figure 6, as it contains numerous discriminative genes. 
In these figures we are again showing one (typical) sam- 
ple from the Gibbs collection. 



It is also of interest to consider Factors 1 and 2 in 
Figure 6. Each of the samples from the individual 
viruses is offset by a distinct amplitude, almost entirely 
independent of whether the sample was symptomatic 
or asymptomatic. This phenomenon associated with 
Factors 1 and 2 in Figure 6 is attributed to challenge- 
study-dependent offsets in the data (the gene-expres- 
sion values were obtained separately for each of these 
studies, and the data normalized separately), which 
account for different normalizations of the data 
between the three distinct viral challenges. This under- 
scores that not all factors have biological significance, 
with some a consequence of the peculiarities of gene- 
expression data (study-dependent offsets in normaliza- 
tion). The other factor-analysis methods (omitted here 
for brevity) produced very similar normalization- 
related factors. 

In Figure 8 are depicted the important genes asso- 
ciated with the discriminative pan-viral Factor 20 in Fig- 
ure 6. It is a subject of further research, but based on 
the data analyzed thus far, it appears the FA model 
applied to gene-expression data cannot distinguish well 
between the different viruses. However, we have applied 
FA jointly to our pan-virus data and to bacterial data 
available from related but distinct studies [27]. From 
that analysis we are able to distinguish between viral- 
based phenotypes and bacteria-based phenotypes; this is 
discussed in greater detail in [25]. 

We have here identified many genes that are inferred 
to be connected with the viruses under study. It has 
been observed, by the medical doctors on our research 
team, that the inferred genes are closely aligned with 
relevant known pathways, with this discussed in detail 
in [25]. 



Chen et al. BMC Bioinformatics 2010, 11:552 
http://www.biomedcentral.eom/1 471 -2 1 05/1 1 /552 



Page 9 of 16 



Factor 1 




Factor 2 Factor 3 Factor 4 

rtHHt i i ii * - - 



Factor 5 



Factor 6 



200 400 

Factor 9 



200 400 

Factor 10 



JttM^JMlfi i 'i n wJtHlBlfttfltWii 1 1 1 1 i* a i J n MlijMw^akkdMH 

' . ^^^^^^ 1 . wmmm> **^ ^^^^PPIWW 



0 200 400 

Factor 13 



0 200 400 

Factor 19 



0 200 400 

Factor 25 



jflO 



U 05 

o 

■0.05, 



0 200 400 

Factor 31 



n ? , 
■u J L 



0 200 400 600 

Factor 37 



0 05 

o 

-0 05 



0 200 400 

Factor 43 



0.05 
0 

AOS 



0 200 400 

Factor 49 



o 

' 1 0 200 400 

Factor 55 



600 



200 400 600 



J 05 
0 

-0.05 



0.1 
0 
-0.1 



0.05 
0 

•0.05 



200 400 600 

Factor 14 



200 400 



200 400 600 

Factor 20 



200 400 

Factor 26 



iio 



200 400 

Factor 32 



200 400 

Factor 38 



200 400 

Factor 44 



200 400 

Factor 50 



200 400 600 

Factor 56 



200 400 600 

Factor 15 




Figure 6 Factor-analysis performed jointly to the Flu, RSV and Rhino data, with BP applied within design of factor scores, as discussed 
in Section IV-B. Results are presented in the same form as Figure 3; the first 220 samples correspond to the Flu data, the next 210 samples 
correspond to Rhino virus, and the remaining samples correspond to RSV. 



III. Conclusions 

We have examined two distinct but related objectives. 
First, in the context of Bayesian factor analysis, we have 
examined three ways of inferring an appropriate number 
of factors. Each of these methods is based on a different 
means of leveraging the utility of the Beta Process, and 
the closely related Indian Buffet Process (IBP). In the 
context of such models, we have examined inference 
based on variational Bayesian analysis, and based on a 
Gibbs sampler. We have also compared these Bayesian 
approaches to state-of-the-art non-Bayesian factor 
models. 

The second contribution of this paper is the introduc- 
tion of a new set of gene-expression data, from three 
time-evolving viral challenge studies. These data allow 
one to examine the time-evolution of Rhino virus, RSV 
and Influenza-A. In addition to the gene-expression 
data, we have also recorded clinical symptom scores, to 
which the gene-expression analysis may be compared. 



With the limited space available here, we have presented 
results on the Influenza data alone, and for all three 
viruses together (a "pan-viral" analysis). 

Based on this study, we may make the following obser- 
vations. For the number of Gibbs iterations deemed 
necessary, the VB and MCMC inference approaches 
required comparable computation time (VB was slightly 
faster, but not substantially). Although VB requires far 
fewer iterations (converges typically in 50 iterations), 
each VB iteration is significantly more expensive than 
that associated with MCMC. The advantage of using 
these two very distinct computational methods on the 
models considered is that they serve to cross-validate 
each other (providing confidence in the results, when 
these two very different methods agree, as they generally 
did in the studies considered). 

Of the three methods of inferring the number of fac- 
tors, the IBP applied to the factor loadings works well 
for small-scale problems, but it is computationally 
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times; the results at a given time are shifted slightly along the horizontal axis with respect to one another, to enhance readability. 
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intractable for the large-scale viral data considered here. 
Applying the Beta Process to the factor scores, or to the 
singular values of a pseudo-SVD construction, yields 
reliable and high-quality results. 

It is not our purpose to provide a detailed (perhaps 
philosophical) discourse on the relative merits of Baye- 
sian and non-Bayesian approaches. However, we 
observed that the non-Bayesian Penalized Matrix 
Decomposition (PMD) yielded very high-quality results, 
as long as the model parameters were set carefully via 
cross-validation; very similar phenomenon was observed 
for the closely related sparse-PCA. Both PMD and 
sparse-PCA infer an appropriate number of factors, but 
one must very carefully set the stop criterion. Since 
PMD and sparse-PCA are much faster than the Bayesian 
approaches, perhaps a good compromise is to use the 
output of these models to initialize the Gibbs sampler in 
a Bayesian solution (this is a subject for future research). 

Concerning the viral data, it was observed that all 
methods were able to infer a factor that was capable of 
distinguishing those subjects who would become symp- 
tomatic from those who would not. It was possible to 
infer a "pan-viral" factor, that was discriminative for all 
viruses considered. 



The evolution of the factor scores tracked well the 
recorded clinical symptom scores. Further, for the dis- 
criminative factor, there was a good association between 
the genes inferred as important and the associated biol- 
ogy [25] (with interpretation provided by the medical 
doctors on our research team). 

IV. Methods 

A. Basic sparse factor model 

Recall the factor model in (1); r defines the number of 
factors responsible for the data X, and it is not known in 
general, and must be inferred. Within the analysis we will 
consider K factors (K columns of A), with K set to a 
value anticipated to be large relative to r. We then infer 
the number of columns of A needed to represent the 
observed data X, with this number used as an estimate of 
r. Since we will be performing a Bayesian analysis, we will 
infer a posterior density function on r. Henceforth we 
assume A has K columns, with the understanding that 
we wish to infer the r <K columns that are actually 
needed to represent the data. 

Let cik represent the kth column of A, for k = 1, . . . , 
K, and ey and s, represent respectively the jth columns 
of E and S (with /=!,..,,«). Within the imposed 
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Figure 8 Set of important genes inferred for combined analysis of Flu, RSV and Rhino data, associated Factor 20 from the BP applied 
to the factor scores (Figure 6). Blue points correspond to samples of individuals who ultimately become symptomatic, and the red points 
correspond to asymptomatic samples. The first 220 samples correspond to the Flu data (encompassing a total of 108 hrs), the next 210 samples 
correspond to Rhino virus (encompassing a total of 96 hrs), and the remaining samples correspond to RSV (encompassing a total of 165.5 hrs). 



prior, vectors e,- and s y are generated as s, ~ J\f (0, I K ), 
and ~ W(0,diag(i//f\ y/p 1 )) ; I/c is the identity 
matrix and the precisions (yti, . . . , y/ p ) are all drawn i.i. 
d. from a gamma prior. 

One may consider many alternative means of defining 
sparseness on the a k , with the choice often dictated by 
convenience; we discuss two such methods here. In one 
approach [11] one may employ a spike-slab prior: 



A lk ~ w m S 0 + (1 - w lk )Af(0, a k l ), 
w l}l ~ Beta {a, b) , a fe ~ Gamma (c, d) 



(2) 



1, 



where (a, b) are selected as to strongly favor w !k 
S 0 is a distribution concentrated at zero, and 1=1,..., 
p. The advantage of (2) is that sparseness is imposed 
explicitly (many components of a k are exactly zero). 

An alternative to (2) is to employ a Student-t prior 
[6], implemented via the hierarchical construction 



A lk ~ J\f(Q, ay, 1 ) , a lk ~ Gamma [e, f) 



(3) 



but now with (e, j) selected as to constitute a Student- 
t sharply peaked about zero. One may employ a similar 
construction to impose a double-exponential (Laplace) 
sparseness-promoting prior [4]. 



B. Beta process for inferring number of factors 

The Beta Process (BP) was first developed by Hjort for 
survival data [28], and more recently it has found 
many other applications and extensions [19-21]. We 
here seek to provide a simple discussion of how this 
construction may be of interest in inferring an appro- 
priate number of factors in factor modeling [22]. Our 
goal is to use the BP construction, which is closely 
related to the Indian buffet process (IBP) [19-21], to 
infer the number of factors r based on the observed 
data X. 

Consider a measure drawn H ~ BP(a, /3, H 0 ) and con- 
structed as 

H = J ?t k S , n k ~ Beta(a / K, p{K - 1) / K), a k ~ H 0 (4) 

We seek to link our construction explicitly to the fac- 
tor model, and therefore is the Ath candidate factor 
loading (column of A), and H 0 is defined by the con- 
struction in (2) or (3), depending upon which model is 
used. The expression n k represents the probability that 
a k is used to represent any particular data sample, 
defined by the columns of X. The expression S ak repre- 
sents a unit point measure concentrated at a k . 
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The BP is closely linked with a Bernoulli Process BeP 
(H) [21]. Specifically, for the /th column of X, we per- 
form a draw from the Bernoulli process 

K 

B i = X 2 ^"*' Zt ' e * 1,0 '' Zh > ~ Bemoulli ( ,l 'i ! ) ' i =1 n (5) 

where the H in BeP(H) is drawn W ~ BP (a, P, H 0 ), as 
defined in (4). As discussed further below, if = 1 then 
a k is used as a factor loading to represent Xp the /th col- 
umn of X; if z k j = 0, a k is not used to represent Xj. In 
other words, Bj is a sum of point measures (S ak is a unit 
point measure concentrated at a k ), and the binary vari- 
ables z k j denote whether specific 8 ak are employed 
within Bj. More details on such constructions may be 
found in [21]. 

To make a connection to the introductory comments 
in Section II-A, and to relate the model to the IBP [19], 
we consider the above construction in the limit K —> °°. 
Further, we marginalize (integrate) out the probabilities 
(ji lt . . . , n K ) used to constitute the BP draw H; we 
retain the K candidate factor loadings {a k } k = hK used to 
define A, as drawn from the BP. Recall that Xj represents 
the /th data sample (/'th column of X). We assume that 
the data samples ("customers") select from among 
"dishes" at a "buffet", with the dishes defined by {a k } k = 
i ?K . Data sample x 1 enters the buffet first, and selects 
the first v 1 dishes where v l is a random 

variable drawn from Possion(a//3). Therefore, the first 
column of S has the first v 1 elements as non-zero, with 
the remaining elements in that column set to zero. The 
second "customer" x 2 then enters the buffet, and selects 
from among the first v l dishes; the probability that x 2 
selects a k , for each of k e {1, . . . , v x }, is 1/(13 + 1); i.e., 
z k2 ~ Bernoulli(l/(/3 + 1)), for k e {1, . . . , v x }. Customer 
x 2 also selects v 2 new dishes {a v \ +1 . . . , a v \ +v2 }, with v 2 
~ Possion(a l(P + 1)). Hence, z k2 = 1 for k e {vi + 1, . . 
• > v i + v 2}» and unless stated explicitly otherwise, all 
other components of z.are zero. This process continues 
sequentially, with each Xj entering the buffet in ascend- 
ing order of /. Sample Xj , with / e {1, . . . , n\ selects 

dishes as follows. Let ' x 



that it will be selected by Xj . Additionally, Xj selects 
new dishes a k for k L + !,..., Cj_\ + Vj), where 



Vj ~ Poisson 



f ^ r Therefore we have z ki j = 1 

for k e {C/_! +1, . . . , C/_i + vj\. Thus, each new custo- 
mer selects from among the dishes (factor loadings) 
already selected by at least one previous customer, and 
the more "popular" one of these dishes is, the more 
probable it is that the new customer will select it. 
Further, a new customer will also select additional 
dishes (factor loadings) not selected by any of the pre- 
vious customers. However, note that as / increases, the 



draws v ] ~ Poisson 



are likely to be 



decreasing in size, since ^ - -j _ ^ is getting smaller 

with increasing /. Therefore, although K — > °°, a finite 
subset of the candidate dishes (factor loadings) {a k } k = lf 
K will be used among the n customers, defined by the 
columns of X, thereby imposing sparseness in the use of 
factor loadings. This model is also fully exchangeable, in 
that the order of the columns of X may be permuted, 
with no change in the properties of the prior [19]. The 
model imposes that many of the n samples will share 
the same set of factors, but the model is flexible enough 
to allow idiosyncratic (sample-dependent) factor usage. 

In practice K is finite, and therefore it is also if inter- 
est to consider the properties of this prior for finite K. 
For finite K, one may show that the number of non- 
zero components of Zj is drawn from Binomial(/C, a /{a 
+ P(K - 1))), and therefore one may set a and P to 
impose prior belief on the number of factors that will be 
important. The expected number of non-zero compo- 
nents in Zj is aK/[a + P(K - 1)]. 

To complete the model specifications, note that a k 
from the Beta-Bernoulli construction above defines the 
kt\\ column of the factor-loading matrix A. The factor- 
score matrix S utilizes the binary vectors z, = (zy , . . . , 
z K j) T defined in (5), for / e {1, . . . , «}. Specifically, we 

define the /th column of S as Sj = Sj ° Zj (° represents a 
2ij=i v ) re P resen t the p 0 i n t-wise, or Hadamard product), with Sj ~ jV(0, 1^) . 



cumulative number of dishes selected off the buffet, 
among the previous customers {xi, . . . , Further, 
let W/_i ;<r > 1 represent the total number of times dish 
a k has been selected by previous customers {x lt . . . , Xj 

for k e {1, . . . , C/_i). Then Xj selects dish a k , k e the columns of S are sparse. 
{1, . . . , C/_i}, with probability rtij-i ik /(P + J - 1); i.e., 



The vector product Sj ° Zj selects a subset of the com- 
ponents in £ , setting the rest to zero, and therefore 

* J 



"k,j 



Bernoulli 



m 



J-U 



P + J-l 



for k 



(1. 



Note that the more "popular" a k among the previous / - 
1 customers {i.e., larger mj_ lik ), the more probable it is 



C. Sparse factor modeling with BP/IBP placed on factor 
loadings 

In the above discussion the Beta-Bernoulli and IBP pro- 
cesses were presented for a specific construction of the 
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factor-analysis model, with the goal of making the con- 
nection to the model explicit and hence clearer. How- 
ever, there are alternative ways of utilizing the IBP for 
design of factor models. Specifically, rather than using 
the binary vectors to construct S, as above, they may 
alternatively be used to define A, with factor scores 
designed as in traditional factor models. This approach 
was considered in [8], using an Indian Buffet Process 
(IBP) construction (explicitly using the marginalization 
discussed above). A limitation of this approach is that 
one must perform p draws from the IBP to construct A, 
and typically p is very large for the gene-expression pro- 
blems of interest. When presenting results in Section 
II-B, we discuss our experience with this model on 
small-scale problems, although this approach was found 
computationally intractable for the motivating virus stu- 
dies considered in Section II-D. 

D. Constructing pseudo singular values 

The final Bayesian construction considered for inferring 
r is closely related to the non-Bayesian sparse-PCA [12] 
and penalized matrix decomposition (PMD) [13] mod- 
els. We generate the vectors {a k } k = hK as before, using a 
sparseness-promoting prior like that discussed in Sec- 
tion IV-A. Further, the factor scores £ k f° r factor loading 

£ is drawn g k ~ J\f (0, I„), for k = I, ... , K;^I constitu- 
tes the kth row of S, and we consider K such rows, for 
large K (relative to the anticipated r). Finally, the vector 
of pseudo singular values X = (X lt . . . , X K ) is generated 

X = z ° w 

z k ~ Bernoulli(7r fe ) , k = 1, K 

n k ~ Betafa / K, j}K / {K - 1)) , k = 1, K (6) 

w ~7V(0, IjJ 

The matrix product AS in (1) is now constituted as 

XffifJ^ ■ The non-zero components of X select 

the columns of A used across all columns of X. As dis- 
cussed in Section IV, the number of non-zero compo- 
nents of X is drawn Binomial(/<", al(a + ji(K - 1))), and 
the posterior on the number of such components pro- 
vides desired information on the number of factors r. 
Note that this construction is like the Beta-Bernoulli 
process discussed above, in that it utilizes n k ~ Beta(a/ 
K, fSK/(K - 1)) and the Bernoulli distribution; however, 
it only draws the binary vector z once, and therefore 
there is not the idea of multiple "customers", as in the 
two IBP-related formulations discussed above. 



E. Computational issues, model quality and hyper- 
parameter settings 

The MCMC results presented here correspond to using 
5000 collection samples, after a burn-in of 2000 itera- 
tions. However, with 2000 burn-in iterations and 500 
collection samples, the average results of the factor 
scores and factor loadings are almost identical to those 
found with 5000. For all MCMC results, we employed a 
singular value decomposition (SVD) of the data matrix 
to initialize the factor loading and factor score matrix in 
the FA model, as well as the right-and left-singular 
matrix in the matrix decomposition model. For each 
iteration of the Gibbs sampler a particular number of 
factors r are employed, and based upon all collection 
samples one may infer an approximate posterior distri- 
bution for r. Running on a typical modern PC, the com- 
putation times are summarized in Table 1 for the 
different models, as applied to the influenza data (using 
100 VB iterations). 

To be explicit, we provide detailed hyper-parameter 
settings for the model in (7)-(14); the other models are 
set similarly. Specifically, a = 1, /} = 1, c = 1, and d = g 
= h = e = f = 10 6 . These parameters were not opti- 
mized, and were set in the same manner for all experi- 
ments. Although the PMD model is a non-Bayesian 
method, it also has parameter settings that must be 
addressed carefully; two hyper-parameters need adjust- 
ing: the sparseness threshold and the stop condition 
[13]. In all PMD experiments, we set the sparseness 
threshold as 4, and the PMD iterations were terminated 
when the reconstruction error was smaller than 5%. 

All calculations were performed on PCs with Intel 
Pentium Dual E2200 processors and 2.00 GB memory, 
and all software was written in Matlab. For the large- 
scale analysis performed on the real data discussed 
above, MCMC required approximately 4 hours of CPU, 
while VB required 3 hours (per analysis). 

Appendix: Gibbs and Variational Bayesian 
Analysis 

We here provide a concise summary of the inference 
methods applied to one of the Bayesian FA models dis- 
cussed above, with this representative of the analysis 
applied to the rest. Specifically, we consider the model 

Table 1 Relative CPU times of the different models, 
implemented on a pc, as applied to the influenza data, 
the pmd method required a few minutes 

CPU Time VB CPU Time MCMC 

(hours) (hours) 

BPFA 0.5 4.87 

Bayesian Pseudo- 0.1 1 3.47 

SVD 

FA in [11] 0.11 4.87 
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discussed in Section IV-B, in which the BP is applied 
within the factor-score matrix. The complete model 



may be expressed as 

x t ~ ^(A(z, » Sj ), dia^" 1 ¥p 1 )) (7) 

z fa ~Bernoulli(7r fe ) (8) 

n k ~ Beta(a / K, f5{K - 1) / K) (9) 

A jk -mo.rjk) (10) 

s f ~M(0,S- 1 l K ) (11) 

Yj k ~ Gamma(c, d) (12) 

yfj ~ Gamma(g, h) (13) 

(5 ~ Gamma(e, f) (14) 



where i = 1, ...,«,/' = 1, . . . , /? and A: = 1, . . . , K 
Gibbs sampler 

The full likelihood of the model is 

p(X,Z,A,S,\ji) = ^QA/"(x j ;(A(z 1 .o Sj )),diag(v)" 1 )A/'(Sj;0,5" 1 I K ) 
i=i 

p K 

X ] in^fA^ja^-^Gamma^jcd) 

X nn&rnoulli( Zii ,; * t )Beta(* t ; |, ^-^) 
i=i fe=i 
P 

X Gamma(i//j ; g, h) X Gamma(5; e, /) 
)=i 

The sequential update equations of the Gibbs sampler 
are as follows. 

♦ Sample each entry of the binary matrix, z ki . The 
probability of z ki = 1 is expressed as 

p(z ki = l\X,Z_ kil A,S,y,)oc\n(x k ) 
-^(AjfdiagftOAfcsjH - 2A fe T diag( V /)Xr fe s fei ). 

♦ Sample n k from p(jtk\-) = Beta(n k ;a',j3') where 



• Sample each entry of factor loading matrix, A y(t from 
p(Ajk\~) = M {Ajk> P/k, where 

£ jk =[Z" = l V 'i S ki Z ki +), jk j ■ ^jk = £ jk(X" = i v 'j Z ki s ki X ri k ' ' and 

X ji* = x ji-Y<i=iMk A fl Z >i S « ■ 

• Sample each column of factor score matrix, 
s t , from />(s/|-) = M {sf, \ u A,) where 

A, = [(A T o Zj)d!ag(v/)(A ° Z.J) + = A,(A o Z,)diag(y )x, , and 

Z; = [z ; , z,] with the 7<T- dimensional vector, z„ 
repeated times, \ < i < n. 

• Sample (//,• from p(i// ; - | -) = Gamma(fj ; gj, h-) 

where g'j = g + ^ and 

• Sample Yjk from p(jj k \-) = Gamma (jj k ; c', d*) where 
c = c+1/2 and d' = d + ^Aj k . 

• Sample 5 from p{§\-) = Gamma (SV, /) where e' = e 

1 v-i T 

+ nK/2 and /' = / + ^ (Sj s,0 In the above equations 

2 i=l 

expressions of the form p(jj k \-) represent the probability 
of Yj k conditioned on all other parameters. 

Variational Bayesian inference 

We seek a distribution Q(0; T ) to approximate the 
exact posterior p(0|X), where in 0 ={A,S,Z,a,n,yy,y,S} 
Our objective is to optimize the parameters T in the 
approximation Q(0; T). Toward that end, consider the 
variational expression 

F(T) = J«»Q(0; T) In p( x^ x) = " P(X) + KL(Q(B; r) 1 1 p(0 | X)) (15) 

Note that the term p(X) is a constant with respect to 
T, and therefore F(T) is maximized when the Kullback- 

Leibler divergence 7<Z(Q(0; r)||/?(0|X)) is minimized. 
However, we cannot explicitly compute the KL diver- 
gence, since p(@\X) is unknown. However, the denomi- 
nator term in F(T) may be computed, since p(X)p(@\X) 

= p{X\&)p{&), and the prior p{&) and likelihood func- 
tion p(X\&) are available. To make computation of 

F(T) tractable, we assume Q(0|T) has a factorized form 
Q(0; T) = n,Q,(0i; T t ). With appropriate choice of Q„ 
the variational expression F[T) may be evaluated analy- 
tically. The update equations are as follows. 
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♦ For z ki we have Q(z ki ) = Bemoul li(z fa ; p' ki ) where 

p' fa is the probability of z ki = 1. We consider the follow- 
ing two conditions: 

discussion below, the symbol < ♦ > represents the 
expectation of the argument. 

In Q(z ki = 1) oc ?1 = (\n„ k ) - [\(-2(X; k ) T _ 

diag(^>XA,X«, fej > + («; fe 2 ;)(A k T diag( V /)A,»] 
lnQ(z fe! =0)oc^ 2 =<ln(l- Wfe )) where 

<i n p - „,j> = * [ + » - 2^ <*>)-* ( ""T" 1 ' + " ) *M = £ ln r W , 

(ln(l - ».» = T jg^il + „ -^ i(zm > " + - *> + » ], TW = jLh.lt,) 

and r 1 (x) = J chT x ~ x e~ T ■ Therefore, we can calculate 

Pw = g^(CJ) Above, and in the discussion 

below, the symbol < • > represents the expectation of 
the argument. 

♦ For n k we have Q[n k ) = Beta(w fe ;a' k ,p' k ) where 

♦ For A y * we have Q(A jk ) = A/"(A jfe ; with 

^ = [XL^iX s «>< z H> + <yjk>] _1 and 

A<jfe = ^;fe(X!=l^j Z fe' S fei X /^ 

, where xf = x jt - Xi=i;Z*fc A /fi s '<' • 

♦ For s, we have Q(s,) = A/"(s,-; A,) , with 
A ; = [<(A T o Z i )diag(V»(A o Z, T )> + (S)l p 1 and 
§ i = A j («A T >o<Z i »diag«^»* j ) , where 
Z i = [z { , z,] is a JC-dimensional vector of all 

repeated ^ times. In order to exactly calculate the 
expectation, 

B = ((A T oZ ; )diag( V /)(AoZ, T )) 

, we have to consider it as two parts. Specifically, the 
off-diagonal elements of B are 

«A T )o(Z i ))diag«i/))«A)o(Z, T )) , and the diagonal 
elements, B kk = (X P . =1 {(A jk ) 2 + T. jk )(y/ j))(z ki ) , since 



(Ap = (A jk ) 2 +Z jk and (A 2 k ) = (A jk ) 2 + Z jfe , where 
l<k<K,l<j<p and 1 < i < n. 

• For yjj we have Q{y/ j) = Gamma(y/j : ; gj, hj) , 

where g] = g + | , h • = h + ± XL ~ A j( z i ° 5 i)^ • 

• For 7 yit we have Q(/ jfe ) = Gamma(Cj fe , d jfe ) , with 

Cj- fe =c + l/2 and d jk = d + ±(A 2 k ). 

• For (5 we have Q(S) Gamma (e',f), where e' = e + Kn/ 
2 and f = f + \YiJ s ^i) ■ 
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